Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
The motivation for this project is to:
Review criteria
library(stringr)
require(quanteda)
require(readtext)
library(R.utils)
library(ggplot2)
set.seed(3301)
Tasks to accomplish
Questions to consider
Dwonload data.
source("downloadData.R")
attach(downloadData(file.path("..", "data")))
c(blogs, twitter, news, badwords)
## [1] "../data/final/en_US/en_US.blogs.txt"
## [2] "../data/final/en_US/en_US.twitter.txt"
## [3] "../data/final/en_US/en_US.news.txt"
## [4] "../data/bad-words.txt"
First, loading in files using my poor implementation is below.
tweets <- 0
wordsTwitter <- 0
sentencesTwitter <- 0
con <- file(twitter, "r")
while (FALSE && length(oneLine <- readLines(con, 1, warn = FALSE)) > 0) {
tweets <- tweets + 1
if(tweets <= 10) {
print(oneLine)
}
words <- str_split(oneLine, "\\s+")[[1]]
symbols <- rep(FALSE, length = length(words))
for(i in 1:length(words)) {
symbols[i] <- grepl("^[^a-zA-Z0-9]+$", words[i])
if(grepl("^[0-9]+$", words[i])) {
words[i] <- "[numbers]"
}
}
wordsPerLine <- length(simpleWords <- words[!symbols])
for(i in 1:length(simpleWords)){
if(grepl("[.!?]$", simpleWords[i])) {
sentencesTwitter <- sentencesTwitter + 1
}
}
wordsTwitter <- wordsTwitter + wordsPerLine
remove(simpleWords, words)
}
close(con)
tweets
wordsTwitter
sentencesTwitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [1] "they've decided its more fun if I don't."
## [1] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [1] "Words from a complete stranger! Made my birthday even better :)"
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [1] "I'm coo... Jus at work hella tired r u ever in cali"
## [1] "The new sundrop commercial ...hehe love at first sight"
## [1] "we need to reconnect THIS WEEK"
## [1] 2360148
## [1] 29706404
## [1] 2818583
Next, loading in files using the readtext package
tweetFile <- readtext(twitter)
corpusTwitter <- corpus(tweetFile, cache = FALSE)
summary(corpusTwitter)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## en_US.twitter.txt 566951 36719658 2588551
##
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Sun Jul 29 03:34:49 2018
## Notes:
Tasks to accomplish
Tips, tricks, and hints
Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done See the ?connections help page for more information.
Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.
Sub-Sampling.
twitterSubSampling <- paste0(twitter, ".sub-sampling.txt")
if(!file.exists(twitterSubSampling)) {
subSamplingSize <- 10000
flipABiasedCoin <- rbinom(tweets, size = 1, prob = subSamplingSize / tweets)
conRead <- file(twitter, "r")
conWrite <- file(twitterSubSampling, "w")
len <- 0
while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
len <- len + 1
if(flipABiasedCoin[len] == 1) {
writeLines(oneLine, conWrite)
}
}
close(conRead)
close(conWrite)
}
subTweets <- as.numeric(countLines(twitterSubSampling))
subTweets
## [1] 9970
Tokenization.
subTweetFile <- readtext(twitterSubSampling)
subTwitterCorpus <- corpus(subTweetFile, cache = FALSE)
summary(subTwitterCorpus)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## en_US.twitter.txt.sub-sampling.txt 20110 154790 10982
##
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Sun Jul 29 03:36:28 2018
## Notes:
Load bad words.
profanity <- readLines(badwords)
Tasks to accomplish
Questions to consider
| Field | Unit | Sample sequence | 1-gram sequence | 2-gram sequence | 3-gram sequence |
|---|---|---|---|---|---|
| Computational linguistics | word | … to be or not to be … | …, to, be, or, not, to, be, … | …, to be, be or, or not, not to, to be, … | …, to be or, be or not, or not to, not to be, … |
Top 20.
subTweetsDfm <- dfm(subTwitterCorpus)
topfeatures(subTweetsDfm, 20)
## . ! the to , i a you and ? : for
## 10753 5448 3830 3312 3172 3063 2657 2265 1809 1787 1695 1660
## in of is " my it on that
## 1596 1531 1493 1309 1307 1264 1218 927
Plot word cloud.
subTweetsDfm %>%
dfm_trim(min_termfreq = 10,
verbose = FALSE) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Nomarize words.
subTweetsDfmNomarized <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
tokens_remove(stopwords('english')) %>%
tokens_remove(profanity)
Top 20 Nomarized words.
topfeatures(dfm(subTweetsDfmNomarized), 20)
## just like get love good day u rt can thanks
## 641 494 470 448 427 402 373 367 366 361
## now one time great know today new lol go see
## 342 330 328 311 308 300 272 270 268 261
Plot word cloud.
dfm(subTweetsDfmNomarized) %>%
dfm_trim(min_termfreq = 10,
verbose = FALSE) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
max_words = 100,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Frequency Plots
featuresTweetsDfm <- textstat_frequency(dfm(subTweetsDfmNomarized), n = 80)
# Sort by reverse frequency order
featuresTweetsDfm$feature <- with(featuresTweetsDfm, reorder(feature, -frequency))
ggplot(featuresTweetsDfm, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
2-Gram
subTweetsDfmNomarized2Gram <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
# tokens_remove(stopwords('english')) %>%
tokens_remove(profanity) %>%
tokens_ngrams(n = 2)
topfeatures(dfm(subTweetsDfmNomarized2Gram), 20)
## in_the for_the of_the on_the to_be going_to
## 319 290 243 218 187 165
## thanks_for to_the i_love thank_you if_you for_a
## 164 158 156 155 155 144
## at_the have_a i_am is_a to_get will_be
## 141 137 128 126 117 116
## i_was i_have
## 111 111
dfm(subTweetsDfmNomarized2Gram) %>%
dfm_trim(min_termfreq = 10,
verbose = FALSE) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
max_words = 100,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Frequency Plots
featuresTweetsDfm2Gram <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram), n = 80)
# Sort by reverse frequency order
featuresTweetsDfm2Gram$feature <- with(featuresTweetsDfm2Gram, reorder(feature, -frequency))
ggplot(featuresTweetsDfm2Gram, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
3-Gram
subTweetsDfmNomarized3Gram <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
tokens_remove(stopwords('english')) %>%
tokens_remove(profanity) %>%
tokens_ngrams(n = 3)
topfeatures(dfm(subTweetsDfmNomarized3Gram), 20)
## happy_mother's_day happy_new_year happy_mothers_day
## 10 10 8
## la_la_la looking_forward_seeing let_us_know
## 8 7 6
## please_follow_back good_right_now cinco_de_mayo
## 4 4 4
## please_please_please feel_better_soon today_good_day
## 4 3 3
## just_got_back merry_christmas_happy come_join_us
## 3 3 3
## happy_st_patrick's st_patrick's_day cant_wait_hear
## 3 3 3
## run_time_nike time_nike_gps
## 3 3
dfm(subTweetsDfmNomarized3Gram) %>%
textplot_wordcloud(min_count = 4,
random_order = FALSE,
max_words = 50,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Frequency Plots
featuresTweetsDfm3Gram <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram), 60)
# Sort by reverse frequency order
featuresTweetsDfm3Gram$feature <- with(featuresTweetsDfm3Gram, reorder(feature, -frequency))
ggplot(featuresTweetsDfm3Gram, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
1-gram 90%tile:
featuresTweetsDfmFull <- textstat_frequency(dfm(subTweetsDfmNomarized))
summary(featuresTweetsDfmFull)
## feature frequency rank docfreq
## Length:15408 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 3853 1st Qu.:1
## Mode :character Median : 1.000 Median : 7704 Median :1
## Mean : 4.481 Mean : 7704 Mean :1
## 3rd Qu.: 2.000 3rd Qu.:11556 3rd Qu.:1
## Max. :641.000 Max. :15408 Max. :1
## group
## Length:15408
## Class :character
## Mode :character
##
##
##
quantile(featuresTweetsDfmFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 7 641
2-gram 90%tile:
featuresTweetsDfm2GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram))
summary(featuresTweetsDfm2GramFull)
## feature frequency rank docfreq
## Length:77857 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.:19465 1st Qu.:1
## Mode :character Median : 1.000 Median :38929 Median :1
## Mean : 1.576 Mean :38929 Mean :1
## 3rd Qu.: 1.000 3rd Qu.:58393 3rd Qu.:1
## Max. :319.000 Max. :77857 Max. :1
## group
## Length:77857
## Class :character
## Mode :character
##
##
##
quantile(featuresTweetsDfm2GramFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 2 319
3-gram 90%tile:
featuresTweetsDfm3GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram))
summary(featuresTweetsDfm3GramFull)
## feature frequency rank docfreq
## Length:68757 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.:17190 1st Qu.:1
## Mode :character Median : 1.000 Median :34379 Median :1
## Mean : 1.004 Mean :34379 Mean :1
## 3rd Qu.: 1.000 3rd Qu.:51568 3rd Qu.:1
## Max. :10.000 Max. :68757 Max. :1
## group
## Length:68757
## Class :character
## Mode :character
##
##
##
quantile(featuresTweetsDfm3GramFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 1 10
ntoken(subTweetsDfmNomarized)
## en_US.twitter.txt.sub-sampling.txt
## 69046
ntype(subTweetsDfmNomarized)
## en_US.twitter.txt.sub-sampling.txt
## 18775
Tasks to accomplish
Questions to consider
Basic 2-gram model:
nextWords2Gram <- function(input) {
featuresNextWord <- NULL
nextWordDfm <- dfm(
tokens_select(
subTweetsDfmNomarized2Gram,
paste0("^", input, "_.*"),
valuetype ="regex"))
if(length(nextWordDfm) > 0) {
featuresNextWord <- textstat_frequency(nextWordDfm, n = 5)
featuresNextWord$feature <-
sapply(as.vector(featuresNextWord$feature),
function(x){
str_split(x, "_")[[1]][2]
})
# Sort by reverse frequency order
featuresNextWord$feature <- with(featuresNextWord, reorder(feature, -frequency))
} else {
# un-seen n-gram.
# backoff models?
}
featuresNextWord
}
Next word of Looking is:
ggplot(nextWords2Gram("Looking"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
Next word of forward is:
ggplot(nextWords2Gram("forward"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
ggplot(nextWords2Gram("went"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
ggplot(nextWords2Gram("to"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
ggplot(nextWords2Gram("be"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
ggplot(nextWords2Gram("a"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
ggplot(nextWords2Gram("great"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")